SG-WRAP: A Schema-Guided Wrapper Generator

نویسندگان

  • Xiaofeng Meng
  • Hongjun Lu
  • Haiyan Wang
  • Mingzhe Gu
چکیده

With the development of the Internet, the World-WideWeb has become everyone’s invaluable information source. However, most of data on the Web is currently in the form of HTML pages, which is neither well-structured nor associated with schema. It is almost impossible to use such data efficiently. Web wrapper technology has been developed to transform unstructured /semi-structured data to semi-structured/structured data, which can be queried and analyzed using matured techniques developed in database and other fields. The main issues of wrapper generation include (1) to identify semantics of the data contained in an HTML document and, (2) to establish the mappings between its structure and its semantics. Various wrapper generation tools developed addressed these issues in different ways [1, 2, 3, 5, 4]. In this paper, we present a wrapper generator, SGWRAP. It was design and developed based on the following observations. First, while user interaction is probably the best way to help the generation of wrappers for a specific HTML source quickly and accurately because of the diversity of HTML pages and limited semantic information expressed in HTML tags, user’s efforts should be minimized as much as possible. Second, the ultimate goal for wrapper generation is to transform the original data into some structured one that is easy to consume, rather than to understand the structure of the original data. When a user gathers data from the Web, s/he must have her/his needs in her/his mind. It is often not necessary to generate wrappers for the entire HTML document. Therefore, SG-WRAP, adopts a novel, schema guided, approach for wrapper generation. With this approach, a user defines the schema of

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SG-WRAM Schema Guided Wrapper Maintenance

The World Wide Web has become one of the most important connections to various sources of information. A large proportion of the data is embedded in HTML documents. This language serves the visual presentation of data in Internet browser, but does not provide semantic information for the data presented. This form of data presentation is, therefore, inappropriate for the demands of automated, co...

متن کامل

A Supervised Visual Wrapper Generator for Web-Data Extraction

Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interest. In this paper, we propose a novel schema-guided approach to wrapper generation. We provide a user-friendly interface that allows users to define the schema of the data to be extracted, and specifies mappings from a HTML page to the target schema. Based on...

متن کامل

An Effective Wrapper Architecture to Heterogeneous Data Source

In this paper, we focus on the problem in information integration system of obtaining data from heterogeneous data source accurately and effectively. XML is used as data exchange format of the wrapper. We design the wrapper architecture based on the conversion and management of the views as the bridge from global schema to local schema of various data sources. Our wrapper has two main subsystem...

متن کامل

IWrap: Instant Web Wrapper Generator

In this paper, we describe an automatic Web wrapper generator that creates specification files, which contain the schema information and extraction rules for a class of Web pages. These specification files can then used by a wrapper engine (e.g. MIT COIN Grenouille) to extract information from the semi-structured Web sites. We create specification files through a WYSIWYG GUI with minimal user i...

متن کامل

Reusing (Shrink Wrap) Schemas by Modifying Concept Schemas

We provide mechanisms that facilitate database design based on a shrink wrap schema. A shrink wrap schema is a well-crafted, complete, global schema that represents an application. We develop the notion of concept schemas as a way to decompose shrink wrap schemas. A concept schema is a subset of an application schema that addresses one particular point of view in an application. To aid in shrin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002